Skip to content

Comments

feat: improve node syncing reporting#5589

Merged
sudo-shashank merged 17 commits intoChainSafe:mainfrom
akaladarshi:akaladarshi/refactor-forest-sync
Apr 28, 2025
Merged

feat: improve node syncing reporting#5589
sudo-shashank merged 17 commits intoChainSafe:mainfrom
akaladarshi:akaladarshi/refactor-forest-sync

Conversation

@akaladarshi
Copy link
Collaborator

@akaladarshi akaladarshi commented Apr 21, 2025

Summary of changes

Changes introduced in this pull request:

  • Added new SyncStatusReport for tracking the syncing progress
  • SyncStatusReport is the replacement of the SyncState to track forest syncing
  • Removed the SyncState Filecoin.SyncState API
  • Introduced a new API SyncStatus with the name of Forest.SyncStatus to return the sync status report
  • Updated chain_follower to use new SyncStatusReport instead of SyncState.

Reference issue to close (if applicable)

Closes #5539

Other information and links

Change checklist

  • I have performed a self-review of my own code,
  • I have made corresponding changes to the documentation. All new code adheres to the team's documentation standards,
  • I have added tests that prove my fix is effective or that my feature works (if possible),
  • I have made sure the CHANGELOG is up-to-date. All user-facing changes should be reflected in this document.

@akaladarshi akaladarshi force-pushed the akaladarshi/refactor-forest-sync branch from ebf05a1 to a365028 Compare April 21, 2025 18:51
@akaladarshi akaladarshi marked this pull request as ready for review April 22, 2025 13:07
@akaladarshi akaladarshi requested a review from a team as a code owner April 22, 2025 13:07
@akaladarshi akaladarshi requested review from hanabi1224 and sudo-shashank and removed request for a team April 22, 2025 13:07
Copy link
Contributor

@lemmih lemmih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Self::Wait { watch } => {
let ticker = Ticker::new(0.., Duration::from_secs(1));
let mut stdout = stdout();
let mut last_lines_printed = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either this should be bool type or we can use a better var name to be clear

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (Renamed it).

print_sync_report_details(&report, &mut current_lines)?;

last_lines_printed = current_lines;
// Break if Synced and not watching
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exit feels more appropriate than Break as a reader

Suggested change
// Break if Synced and not watching
// Exit if synced and not in watch mode.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

fn format_tipset_cids(cids: &str) -> &str {
if cids.is_empty() { "[]" } else { cids }
/// Prints the sync status report details.
/// `line_count` is mutable and incremented for each line printed, used for clearing in `Wait`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// `line_count` is mutable and incremented for each line printed, used for clearing in `Wait`.
/// `line_count` is incremented for each printed line and is used for terminal cleanup (e.g., in `Wait` mode).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely removed this part, now returning the number of printed lines from the print_sync_report_details fn directly.

if cids.is_empty() { "[]" } else { cids }
/// Prints the sync status report details.
/// `line_count` is mutable and incremented for each line printed, used for clearing in `Wait`.
/// Pass `&mut 0` if line counting/clearing is not needed (like in `Status` or final print).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Pass `&mut 0` if line counting/clearing is not needed (like in `Status` or final print).
/// Pass `&mut 0` to disable line counting/clearing (like in `status` or final print).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

check_snapshot_progress(client, true).await
}

/// Handles the initial check for snapshot download if the node is initializing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Handles the initial check for snapshot download if the node is initializing.
/// Checks if snapshot download is required or in progress when the node is initializing.
/// If a snapshot download is in progress, it waits for completion before starting sync monitor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

/// Handles the initial check for snapshot download if the node is initializing.
async fn handle_initial_snapshot_check(client: &rpc::Client) -> anyhow::Result<()> {
let initial_report = SyncStatus::call(client, ()).await?;
// Use the public getter method instead of accessing the private field
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is not clear to me, like which private field and why ?

Copy link
Collaborator Author

@akaladarshi akaladarshi Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the comment as a reminder to use getters instead of directly accessing private fields during refactoring, but forgot to remove it. Will update it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the comment.

@sudo-shashank
Copy link
Contributor

sudo-shashank commented Apr 23, 2025

@akaladarshi code changes overall LGTM 👍🏻, requested some minor changes.
Some comments could be more clear and worth mentioning and useful why we are doing something instead of what is obvious from code. Suggested a few 🙂

let fork_info = ForkSyncInfo {
target_tipset_key: last_ts.key().clone(),
target_epoch: last_ts.epoch(),
// The epoch from which sync activities (fetch/validate) need to start for this fork.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment makes more sense in ForkSyncInfo struct than here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

target_sync_epoch_start: first_ts.epoch(),
stage,
validated_chain_head_epoch: current_validated_epoch,
start_time, // Track when this fork's sync task was initiated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

stage,
validated_chain_head_epoch: current_validated_epoch,
start_time, // Track when this fork's sync task was initiated
last_updated: Some(now), // Mark the last update time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, I guess you already have it in ForkSyncInfo

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

}

// Print the status report once, without line counting for clearing
print_sync_report_details(&sync_status, &mut 0)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Option instead of &mut 0 to disable line counting is more suitable here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored it, now returning printed lines from the from function itself.

/// Node is significantly behind the network head and actively downloading/validating.
#[strum(to_string = "Syncing")]
Syncing,
/// Node is close to the network head (e.g., within a configurable threshold like 5 epochs).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the threshold configurable somewhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it's not configurable, I don't think it needs to be configurable because its just for changing the sync status to synced from syncing, earlier it was 10.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, the comment was misleading

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

match stateless_mode {
true => self.set_status(NodeSyncStatus::Offline),
false => {
if time_diff < seconds_per_epoch as u64 * 5 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth commenting why 5, best if we can avoid hardcoding

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a randomly selected value, just like SyncState had 10, I just decrease it to be more closer to node head. will make it a constant but open to suggestion for this value.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made it a constant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LesnyRumcajs any suggestions for this value here

Comment on lines 201 to 202
self.set_network_head(network_head_epoch as ChainEpoch);
self.set_epochs_behind(network_head_epoch as i64 - current_chain_head_epoch);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as ChainEpoch is same as i64, lets standardise
or handle it when initialising network_head_epoch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or change calculate_expected_epoch return type to ChainEpoch

Copy link
Collaborator Author

@akaladarshi akaladarshi Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everywhere we were converting the returned value of calculate_expected_epoch to i64 so updated calculate_expected_epoch return type to i64.

Comment on lines 180 to 182
if *line_count > 0 {
// Only increment if we are in the Wait command context
*line_count += 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from this it's not clear how we determine we are in Wait context, adding a comment above conditional check will make more sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also using Option::None instead of 0 is better

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this part.

signature for crate::shim::crypto::Signature,
signature_type for crate::shim::crypto::SignatureType,
signed_message for crate::message::SignedMessage,
sync_stage for crate::chain_sync::SyncStage,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we had some test in SyncStage, can we have some for SyncStatus?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test was there only for checking if the SyncStage is compatible with the LotusJson for comparing with the Lotus API, but since SyncStatus is specific to Forest we don't need those tests.

And to test SyncStatus we need mocks of all the other components it belongs to (ChainStore, StateManager etc.), I don't think we currently have that. We should though.

Comment on lines 1674 to 1675
match sync_status.get_status() == NodeSyncStatus::Syncing {
true => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when matching on boolean values, I feel If and Else is more suitable and readable

Copy link
Member

@LesnyRumcajs LesnyRumcajs Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I agree, though we already have several occurrences of such style in Forest, so it's not a huge deal. For this particular one, I'd avoid comparing booleans and suggest matching on sync_status.get_status() directly. This way, when a new state is introduced, we'll get a compilation error due to an unhandled enum variant.

Copy link
Collaborator Author

@akaladarshi akaladarshi Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented the match statement. Since we only care about Syncing status and returning error in all other case, I think using wildcard (_) is more appropriate.

@sudo-shashank
Copy link
Contributor

@akaladarshi some more code feedback

CHANGELOG.md Outdated
### Breaking

- [#5559](https://github.com/ChainSafe/forest/pull/5559) Change `Filecoin.ChainGetMinBaseFee` to `Forest.ChainGetMinBaseFee` with read access.
- [#5589](https://github.com/ChainSafe/forest/pull/5589) Replace exiting `Filecoin.SyncState` API with new `Forest.SyncStatus` to track node syncing progress specific to forest.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- [#5589](https://github.com/ChainSafe/forest/pull/5589) Replace exiting `Filecoin.SyncState` API with new `Forest.SyncStatus` to track node syncing progress specific to forest.
- [#5589](https://github.com/ChainSafe/forest/pull/5589) Replace existing `Filecoin.SyncState` API with new `Forest.SyncStatus` to track node syncing progress specific to Forest.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

use std::sync::Arc;
use tracing::log;

// Node considered synced if the head is within this many epochs
Copy link
Contributor

@sudo-shashank sudo-shashank Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Node considered synced if the head is within this many epochs
// Node considered synced if the head is within this threshold.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

pub(crate) fn get_status(&self) -> NodeSyncStatus {
self.status.clone()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clone can be avoided here

Copy link
Collaborator Author

@akaladarshi akaladarshi Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will be negligible since NodeSyncStatus is an enum, anyways added a copy trait in the enum and removed the clone.

/// Node is significantly behind the network head and actively downloading/validating.
#[strum(to_string = "Syncing")]
Syncing,
/// Node is close to the network head (e.g., 5 epochs).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Node is close to the network head (e.g., 5 epochs).
/// Node is close to the network head, within the `SYNCED_EPOCH_THRESHOLD`

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@sudo-shashank sudo-shashank requested a review from lemmih April 24, 2025 07:13
@akaladarshi
Copy link
Collaborator Author

I am also planning to remove the existing SyncSnapshotProgress API and include snapshot tracking directly into our new SyncStatus API.
This will allow us to have snapshot progress tracking and syncing progress in single API. It will done in a subsequent PR since current PR is already very big, @LesnyRumcajs Do you have any thoughts on this?.

Copy link
Member

@LesnyRumcajs LesnyRumcajs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, the output is more understandable than before.

Comment on lines 15 to 16
// Node considered synced if the head is within this threshold.
const SYNCED_EPOCH_THRESHOLD: u64 = 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, that's 2m30s for calibnet and mainnet. Did lower values give you trouble?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see any problem with this, I was experimenting with different numbers to see if there could be any issue but didn't. So just choose 5, since last was 10 (@LesnyRumcajs was there any specific reason for choosing 10?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember exactly. Lower values might have given flaky results in the healthcheck endpoint, i.e., frequent transitions between healthy and unhealthy due to forks. I'm okay with both.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I revert back to 10, It's not much of a gap in terms of waiting for node status to change to sync as it's a one time thing while starting up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's revert to limit the changes. If someone reports that the difference is too big, we can always revisit this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

#[strum(to_string = "Synced")]
Synced,
/// An error occurred during the sync process.
#[strum(to_string = "error")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[strum(to_string = "error")]
#[strum(to_string = "Error")]

use same case

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

#[strum(to_string = "error")]
Error,
/// Node is configured to not sync (offline mode).
#[strum(to_string = "offline")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[strum(to_string = "offline")]
#[strum(to_string = "Offline")]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
}

pub(crate) fn set_current_chain_head_key(&mut self, tipset_key: TipsetKey) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd avoid having a structure with private fields and a bunch of setters and getters. If they are meant to be used outside of the struct, mark the fields as pub(crate). Otherwise, it'd be nice to encapsulate this if possible so that the user doesn't have to know the internal structure of it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove the setters and getters and use the pub(crate) directly on the fields.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +201 to +208
let now = Utc::now();
let now_ts = now.timestamp() as u64;
let seconds_per_epoch = state_manager.chain_config().block_delay_secs;
let network_head_epoch = calculate_expected_epoch(
now_ts,
state_manager.chain_store().genesis_block_header().timestamp,
seconds_per_epoch,
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be a separate method somewhere.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everywhere else we are directly using the calculate_expected_epoch and passing the fields, but here we have introduced the variables just so we can reuse them. I don't think we need to introduce a separate method just for this specific use case.

Copy link
Contributor

@sudo-shashank sudo-shashank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sudo-shashank sudo-shashank added this pull request to the merge queue Apr 25, 2025
@sudo-shashank sudo-shashank removed this pull request from the merge queue due to a manual request Apr 25, 2025
const PARAM_NAMES: [&'static str; 0] = [];
const API_PATHS: BitFlags<ApiPaths> = ApiPaths::all();
const PERMISSION: Permission = Permission::Read;

Copy link
Contributor

@sudo-shashank sudo-shashank Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add DESCRIPTION

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@sudo-shashank sudo-shashank added this pull request to the merge queue Apr 28, 2025
Merged via the queue into ChainSafe:main with commit 8e7c0bb Apr 28, 2025
42 checks passed
@akaladarshi akaladarshi deleted the akaladarshi/refactor-forest-sync branch April 28, 2025 11:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve sync status reporting

4 participants